Combining Mixture Components for Clustering.

نویسندگان

  • Jean-Patrick Baudry
  • Adrian E Raftery
  • Gilles Celeux
  • Kenneth Lo
  • Raphaël Gottardo
چکیده

Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K. These clusterings can be compared on substantive grounds, and we also describe an automatic way of selecting the number of clusters via a piecewise linear regression fit to the rescaled entropy plot. We illustrate the method with simulated data and a flow cytometry dataset. Supplemental Materials are available on the journal Web site and described at the end of the paper.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust Method for E-Maximization and Hierarchical Clustering of Image Classification

We developed a new semi-supervised EM-like algorithm that is given the set of objects present in eachtraining image, but does not know which regions correspond to which objects. We have tested thealgorithm on a dataset of 860 hand-labeled color images using only color and texture features, and theresults show that our EM variant is able to break the symmetry in the initial solution. We compared...

متن کامل

Clustering and combining pattern of metabolic syndrome components among Iranian population with latent class analysis

  Background: Metabolic syndrome (MetS), a combination of coronary heart disease and diabetes mellitus risk factor, refer to one of the most challenging public health issues in worldwide. The aim of this study was to identify the subgroups of participants in a study on the basis of MetS components.   Methods: The cross-sectional study took place in the districts related to Teh...

متن کامل

Estimating the Spatial Position of Spectral Components in Audio

One way of separating sources from a single mixture recording is by extracting spectral components and then combining them to form estimates of the sources. The grouping process remains a difficult problem. We propose, for instances when multiple mixture signals are available, clustering the components based on their relative contribution to each mixture (i.e., their spatial position). We intro...

متن کامل

Robust state clustering using phonetic decision trees

The widely used acoustic modeling approach of phonetic decision-tree based context clustering does not take full advantage of limited training data, and therefore fails to produce robust acoustic models. Two problems are identified: (1) all states clustered in a leaf node must share the same set of Gaussian components and mixture weights; no distinction is provided among those states; (2) rarel...

متن کامل

The Geometry of Kernelized Spectral Clustering

Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America

دوره 9 2  شماره 

صفحات  -

تاریخ انتشار 2010